Skip to main content

View Post [edit]

Poster: StarbriteScanz Date: Aug 25, 2018 10:45am
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

As far as I've experienced there's nothing in that document, at least at bitmap resolution you uploaded, that is problematic. It has a neat regimented colum layout that makes first-pass area recognition by (at least) my OCR software quite accurate. One thing I did find is that OCR'ing the pages as-is at a without changing the dpi doesn't result in smoother area recognition, quite the opposite really as there's more background noise that is picked up as false positives.

I ran a second OCR process on your original 692MB PDF but first I extracted all the pages as bitmaps and did some pre-processing on them. I changed the black and white levels to clear up some of the greyness and strengthen the lettering. I then applied a light Gaussian blur followed by a mild sharpen to clean up some of the edges, changed the dpi to 300 and saved them out as JPGs at 50% compression. This gave a total size for all the pages as 530MB

The JPGs were loaded into ABBY 9 and auto-recognized then I manually removed all the graphics and tidied up some of the text boxes and processed everything. I created two PDFs, one was saved out at 200dpi at 50% JPEG compression for the graphics at 335MB the other was text only, no graphics, which came to only 2.5MB. So you can see how accurate the text recognition was I've attached the text only PDF here. If you have or can get access to a program that can merge layers then you can add this PDF to an existing one of the same size as a text layer and bypass the whole process but I don't believe there's any free software that will currently do this.

With regard to the 335MB version, I think it's quite readable on my HD monitor up to a magnification factor of 800%. Reducing the JPG compression level results in more artifacting\blurring at that level while reducing the dpi (to say 72dp) lowers the resolution of the image regardless of the compression level. It's always a balancing act to produce the smallest sized PDF with the clearest readability and there is software available that can take a PDF and let you change both the dpi and compression levels, as well as the compression method, until you get the balance you feel comfortable with before saving it off.

Having checked the resolution of the bitmaps extracted above all my software is telling me that they are 72dpi, so where ABBY gets 840dpi from I don't know. I can only guess that it's decided the pages are a certain size (perhaps related to the problems with you had with forced page sizing) and done the maths to come up with some OTT figure. Then again at 72dpi if you printed a page out it would be 73x100 inches or roughly 6 foot by 8 foot so something is amiss somewhere in the original file(s).

While all this does not directly address the problems you are having converting your file I hope it shows that things are not as bleak as they seem. If you still have the original scans I would suggest batch-changing the dpi of all of them to 300dpi - you can use a free program like Irfanview or ImBatch to do this and creating the PDF yourself again with a free program such as Compulsivecode's 'Image To PDF or XPS' and resubmit it. Alternatively you can create a CBR file by .RAR compressing the bitmap files, changing the suffix to CBR and uploading that instead. OCR will still be performed on it. I don't know if either approaches will invoke the OCR if you upload them to the Test section but it's worth trying with a small subset of pages just to see if it does and how accurate the output is.

SB

Attachment: Amended_Buffalo_Text_Only.pdf

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 11:53am
Forum: faqs Subject: Thank you thank you thank you! (+!!!!!)

Again, a thousand thanks for the reply.

Some of what you said above (well, OK, a LOT of it) is over my head. I'm just a guy who has some old documents who wants to share them with the world. Mainly I deal with medical issues day-to-day, so this document stuff is only a part-time part-time past-time. I very-occasional write some articles for the local steam-show newsletter, but that's another very-seldom-time past-time.

I tried the post-PDF compression after creating a PDF using the original 400-dpi scans, and sure enough you can't read the text (i.e., it's too compressed). It'd be nice if they offered different compression levels at SmallPDF.com.

I believe you may've hit the head on the nail with the concept of lightening up of the pages. I usually do that on my other documents, but these pages were scanned off-site, and the folks there tweaked the pages post-scan.

So, I lazily left the JPGs alone because I'd been asked repeatedly for this document in more-legible form, so I was shooting for expeditiousness (gee, whut a surprize, I initiallly spelted that werd wrongedly). For the record, the original scans were 400-dpi TIFFs, which I converted using Picasa. Now I can't find the original TIFFs, not that that'd help.

I'll keep trying. Again, thank you for the much-detailed reply.

/Bri in Elma NY USA

Reply [edit]

Poster: StarbriteScanz Date: Aug 25, 2018 4:04pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

I'll keep this one short. If you don't want to continue trying with this then I've created a new PDF of this publication and put it on a 30day hosting site for you to do with what you want. The URL for it is https://ufile.io/js2gp. The PDF is 83MB but it's the same dimensions as the original, 300dpi and the image quality is a lot better. There are likely a few spelling errors due to the OCR but the whole thing is fully searchable. I think it was said that if you upload it here the included text layer (which BTW is just the OCR'ed text which can either be invisible i.e. "below" the page or visible and overwrites the original page) means it won't be re-processed. Hope this helps you out.

SB

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 4:52pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

I've attached a screen shot of what I'm attempting. Maybe it'll work.

I tried uploading your PDF to replace the PDF WITH TEXT file.

/Bri...

Attachment: Uploading_the_new_1888_file_screen_shot.JPG

Reply [edit]

Poster: StarbriteScanz Date: Aug 25, 2018 6:25pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

I don't think it will let you add your own OCR'ed file, not according to this - https://archive.org/about/faqs.php#1165

The best you can probably achieve is to replace the main PDF file and hope that since it already contains a text layer a new derive won't produce a second OCR'ed file. I think that's what was said by a mod earlier in this Thread?

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 10:31pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

If I may mail you a free t-shirt (which advertises my website) as a very-small thanks for the work you performed so kindly, please send shirt size and mailing address to my webmaster email address.

/Bri in Elma NY USA
>webmaster of www.BuffaloPitts.com

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 7:23pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

Uh, I believe the replacement "took." Check it out. PDF-page 04 is rotated (as in the file you provided), and the text is highlightable (is that even a word?). Link is below.

https://ia801508.us.archive.org/14/items/1888BuffaloNYIndustrialFairPaper/1888%20Buffalo%20NY%20Industrial%20Fair%20Paper_text.pdf

A wise man once said: "Never tell a stupid person he can't do something because he'll somehow, someway, and quite ineptly, make it happen, often with little effort or issue."

But, will it stay that way? Inquiring nerds want to know.

Again, a thousand thanks.

/Bri in Elma NY USA

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 4:34pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

Thank you, but I believe my 30 days are already up... "file not found -- 404." I'm sorry to be a pest.

/Bri...

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 4:37pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

Nevermind... I copy & pasted the text, and took out the period. It seems to be working... to wit: "slow download for free yada yada."

Bri...